feat(AGX1-274): record task creator identity and FGAC migration safety#246
Open
asherfink wants to merge 2 commits into
Open
feat(AGX1-274): record task creator identity and FGAC migration safety#246asherfink wants to merge 2 commits into
asherfink wants to merge 2 commits into
Conversation
13fe4b2 to
7486e5a
Compare
8 tasks
7486e5a to
b9cb26b
Compare
4 tasks
ad1e980 to
3a06be8
Compare
dm36
added a commit
that referenced
this pull request
May 28, 2026
…AGENT_API_KEYS_DUAL_WRITE flag Mirrors the AGX1-274 task dual-write pattern (PR #246) for agent_api_keys. - Adds creator_user_id / creator_service_account_id / spark_authz_zedtoken columns to agent_api_keys, with CHECK constraint and concurrent indexes. - On create, when FGAC_AGENT_API_KEYS_DUAL_WRITE is enabled for the caller's account, calls authorization_service.grant(AgentexResource.api_key(id)) BEFORE the Postgres write. Grant failure aborts the create. - On delete, best-effort revoke after the Postgres delete. Failures are logged but do not block the delete. - Adds AgentexResourceType.api_key and AgentexResource.api_key(...) factory. - Creates src/utils/feature_flags.py with both FGAC_TASKS_DUAL_WRITE and FGAC_AGENT_API_KEYS_DUAL_WRITE (file does not exist on main yet; if PR #246 lands first this becomes a rebase concern). Structural divergence from tasks: agent_api_keys have no service layer, so the dual-write logic lives in AgentAPIKeysUseCase rather than a separate service. This keeps the call site simple and avoids inventing a new layer. Route layer (read-side auth checks) is out of scope; that's PR B (AGX1-273). agentex-auth spark_mapping.py update is a sibling-repo concern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
asherfink
added a commit
that referenced
this pull request
May 28, 2026
…n creation Adds two nullable creator-audit columns to the `tasks` table — `creator_user_id` and `creator_service_account_id` — populated from the request principal in `AgentTaskService.create_task`. A CHECK constraint `ck_tasks_at_most_one_creator` enforces that at most one of the two is set; partial indexes back future "tasks created by X" lookups. Online migration: the CHECK is added `NOT VALID` then `VALIDATE`d separately so the brief ACCESS EXCLUSIVE lock doesn't have to wait on an existence scan. `tasks` is a high-write table; a vanilla CHECK addition would queue behind in-flight transactions and block readers until released. Indexes use `CREATE INDEX CONCURRENTLY` inside `autocommit_block`. Best-effort attribution: tasks created outside an HTTP request context (Temporal activities, background workers, any path that constructs `AgentTaskService` without `request.state.principal_context`) leave both columns NULL. The CHECK constraint allows both-NULL, and an integration test exercises the no-resolvable-creator path. These columns are how the AGX1-291 operator runbook identifies orphan rows for backfill when the dual-write call sites added in the next commit fail under load. Part of the AGX1-264 stack: scaleapi/scaleapi NEW2 (per-account FF endpoint) → scaleapi/agentex#353 (agentex-auth routing + cancel) → this PR → #249 (per-RPC route migration). Two commits land together in #246; this one is the schema/audit change and is independent of the dual-write call sites.
3a06be8 to
f7910e3
Compare
asherfink
added a commit
that referenced
this pull request
May 28, 2026
Rewires the operation literal sent to agentex-auth on task RPC routes so each RPC checks the permission that actually matches its side effect, instead of using `execute` everywhere: - `MESSAGE_SEND` / `EVENT_SEND` → `update` - `TASK_CANCEL` → `cancel` - `TASK_CREATE` stays `create` - Unknown `AgentRPCMethod` values now raise `NotImplementedError` rather than falling through authz-free (defense-in-depth: a new RPC must be explicitly wired before it can dispatch). The same `execute → update` swap is applied across `messages.py`, `checkpoints.py`, and `states.py` so the editor role can perform routine mutations without needing owner. The task SpiceDB schema defines `permission update = (editor + owner) & internal_tenant_gate`, so leaving these on `execute` (owner-only) would lock editors out of normal flows. Adds `check_task_or_collapse_to_404` in `src/utils/task_authorization.py` and routes every task-resource denial path through it: path id, query id, body id, and the name surface in `authorization_shortcuts.py`. `tasks.name` is globally unique, so a 403/404 split on the name route would let any authenticated caller probe the whole system for task existence — collapsing both denial cases into 404 closes that leak at the cost of an in-tenant UX regression on permission-gap updates (tracked under AGX1-290). The `MESSAGE_SEND` task-name branch is restructured to `try/else`: a denied update on an existing task must surface as 404 and NOT fall through to the create check, which would promote "denied update" into create access. Cross-repo wire dependency: the `update` and `cancel` literals must resolve against the existing OWNER grant in SGP's task permission schema before this deploys, otherwise every in-flight agent's RPCs break at deploy time. Held behind that verification. Part of the AGX1-264 stack: scaleapi/scaleapi NEW2 (per-account FF endpoint) → scaleapi/agentex#353 (agentex-auth routing + cancel) → #246 (task FGAC dual-write + audit columns) → this PR.
…n creation Adds two nullable creator-audit columns to the `tasks` table — `creator_user_id` and `creator_service_account_id` — populated from the request principal in `AgentTaskService.create_task`. A CHECK constraint `ck_tasks_at_most_one_creator` enforces that at most one of the two is set; partial indexes back future "tasks created by X" lookups. Online migration: the CHECK is added `NOT VALID` then `VALIDATE`d separately so the brief ACCESS EXCLUSIVE lock doesn't have to wait on an existence scan. `tasks` is a high-write table; a vanilla CHECK addition would queue behind in-flight transactions and block readers until released. Indexes use `CREATE INDEX CONCURRENTLY` inside `autocommit_block`. Best-effort attribution: tasks created outside an HTTP request context (Temporal activities, background workers, any path that constructs `AgentTaskService` without `request.state.principal_context`) leave both columns NULL. The CHECK constraint allows both-NULL, and an integration test exercises the no-resolvable-creator path. These columns are how the AGX1-291 operator runbook identifies orphan rows for backfill when the dual-write call sites added in the next commit fail under load. Part of the AGX1-264 stack: scaleapi/scaleapi#145000 (FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi sgp-agentex-cancel-enum (cancel enum on SGP backend) -> scaleapi/agentex#353 (agentex-auth per-account routing + cancel) -> this PR -> #249 (per-RPC route migration). Two commits land together in #246; this one is the schema/audit change and is independent of the dual-write call sites in the next commit.
f7910e3 to
31d3d40
Compare
asherfink
added a commit
that referenced
this pull request
May 28, 2026
Rewires the operation literal sent to agentex-auth on task RPC routes so each RPC checks the permission that actually matches its side effect, instead of using `execute` everywhere: - `MESSAGE_SEND` / `EVENT_SEND` -> `update` - `TASK_CANCEL` -> `cancel` - `TASK_CREATE` stays `create` - Unknown `AgentRPCMethod` values now raise `NotImplementedError` rather than falling through authz-free (defense-in-depth: a new RPC must be explicitly wired before it can dispatch). The same `execute -> update` swap is applied across `messages.py`, `checkpoints.py`, and `states.py` so the editor role can perform routine mutations without needing owner. The task SpiceDB schema defines `permission update = (editor + owner) & internal_tenant_gate`, so leaving these on `execute` (owner-only) would lock editors out of normal flows. Adds `check_task_or_collapse_to_404` in `src/utils/task_authorization.py` and routes every task-resource denial path through it: path id, query id, body id, and the name surface in `authorization_shortcuts.py`. `tasks.name` is globally unique, so a 403/404 split on the name route would let any authenticated caller probe the whole system for task existence — collapsing both denial cases into 404 closes that leak at the cost of an in-tenant UX regression on permission-gap updates (tracked under AGX1-290). The `MESSAGE_SEND` task-name branch is restructured to `try/else`: a denied update on an existing task must surface as 404 and NOT fall through to the create check, which would promote "denied update" into create access. Cross-repo wire dependency: the `update` and `cancel` literals must resolve against the existing OWNER grant in SGP's task permission schema before this deploys. `update` is already in SGP's `AgentexOperation` enum; `cancel` is added by scaleapi/scaleapi sgp-agentex-cancel-enum. Held behind that PR shipping everywhere. Part of the AGX1-264 stack: scaleapi/scaleapi#145000 (FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi sgp-agentex-cancel-enum (cancel enum on SGP backend) -> scaleapi/agentex#353 (agentex-auth per-account routing + cancel) -> #246 (task FGAC dual-write + audit columns) -> this PR.
|
Some tests are failing: |
…TE flag
Wires `register_resource` / `deregister_resource` into
`AgentTaskService.create_task` / `delete_task`, gated by a new
`FGAC_TASKS_DUAL_WRITE` env-var (off by default; resolved at
DI-resolve time so mid-process flips are intentionally invisible —
rollout assumes a redeploy cycles pods).
- `create_task`: after the Postgres row persists, register the
task in the authorization graph with `parent=agent` so the
tenant + owner + parent_agent tuples are written atomically.
- `delete_task`: pre-resolves the task id by name before the
Postgres delete (lookup-after-delete would race), then
deregisters once the row is gone. The name-lookup
`ItemDoesNotExist` is swallowed so the subsequent `delete()`
surfaces its own native error — flipping the flag must not
change the error contract for missing tasks.
- Both call sites share `_dual_write_with_retry(op_name, do_call,
task_id)`, which retries transient
`AuthenticationServiceUnavailableError` /
`AuthenticationGatewayError` with exponential backoff + jitter
(3 retries -> 4 attempts max). Mirrors
`agents_acp_use_case.grant_with_retry`, but with no `fail_task`
fallback: the Postgres row is the durable record and orphan auth
tuples are preferable to losing the task. The AGX1-291 operator
runbook covers backfill using the creator-audit columns added in
the parent commit.
- Emits `task_fgac_dual_write.{attempt,success,retry,failure}`
statsd counters (tagged `op:register|deregister` and
`exception_class` on failure) — the rollout signal for the
FGAC_TASKS_DUAL_WRITE flip dashboard.
The `Port` interface gains `register_resource` /
`deregister_resource`, and the agentex-auth proxy adapter calls
`POST /v1/authz/register` and `POST /v1/authz/deregister`. The
endpoints themselves already live on agentex-auth `main` via #354;
per-account routing across them is set by scaleapi/agentex#353.
Part of the AGX1-264 stack: scaleapi/scaleapi#145000
(FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi
sgp-agentex-cancel-enum (cancel enum on SGP backend) ->
scaleapi/agentex#353 (agentex-auth per-account routing + cancel) ->
this PR -> #249 (per-RPC route migration).
31d3d40 to
a9bc09a
Compare
Author
|
updated, tests all pass now :) |
jenniechung
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related work
Parent epic: AGX1-264 — per-task FGAC. Follow-ups bundled in AGX1-291.
This change is part of a 5-PR stack across 3 repos.
scaleapi/scaleapi#144783✅ mergedAction.CANCELFGAC_AGENTEX_AUTH_SPARKflagcancelto SGP'sAgentexOperationenum + role mapcancelopSummary
Commit 1 — passive audit columns
creator_user_idandcreator_service_account_idcolumns to thetaskstable, populated from the request principal inAgentTaskService.create_task. Best-effort (NULLable; see caveat below).CHECK ((creator_user_id IS NULL) OR (creator_service_account_id IS NULL))constraint (ck_tasks_at_most_one_creator) to enforce at-most-one creator type at the DB layer.ix_tasks_creator_user_idandix_tasks_creator_service_account_id(CREATE INDEX CONCURRENTLY ... WHERE column IS NOT NULL) for future "tasks created by X" lookups.Commit 2 — FGAC dual-write call sites + flag
FGAC_TASKS_DUAL_WRITEenv-var flag, injected intoAgentTaskServicevia FastAPI DI. Gates the dual-write behavior end-to-end.create_taskcallsregister_resource(task, parent=agent)on the authorization service after the Postgres row persists, so the task is registered withtenant + owner + parent_agenttuples atomically (via agentex-auth's/v1/authz/register, landed on agentex-authmainvia #354).delete_taskcallsderegister_resource(task)after the Postgres delete. Pre-resolves the task id by name first so the post-delete deregister doesn't race the lookup._dual_write_with_retry(op_name, do_call, task_id)helper. RetriesAuthenticationServiceUnavailableError/AuthenticationGatewayErrorwith exponential backoff + jitter (3 retries → 4 total attempts max), mirroringAgentsACPUseCase.grant_with_retry. Non-transient exceptions are not retried.task_fgac_dual_write.attempt|success|retry|failure) tagged withop:register|deregisterandexception_class:<name>on failure — the rollout signal for AGX1-291's operator runbook.parent=kwarg matches agentex-auth's wire schema (set by #354); the proxy adapter serializes it to the JSON fieldparent.Migration safety
ALTER TABLE ... ADD CONSTRAINT ... NOT VALID+ALTER TABLE ... VALIDATE CONSTRAINT— splits the operation so the briefACCESS EXCLUSIVElock doesn't have to wait on an existence scan.tasksis high-write; a CHECK addition withoutNOT VALIDwould queue behind in-flight transactions and block readers until released.CREATE INDEX CONCURRENTLYin anautocommit_block.a1f73ada66c5(add_task_creator_columns).down_revisionis6c942325c828(adding_task_cleaned_at, the current alembic head on main);migration_history.txtregenerated viaalembic history. The ORM-sideCheckConstraintinorm.pymatches the DB-side (same constraint name + predicate).Rollout
register_resourceandderegister_resourcefire on create/delete. If they fail after retries, the Postgres row is still the durable record — orphan auth tuples can be cleaned up out of band per the AGX1-291 operator runbook using the creator-audit columns to identify them.Audit-trail caveat
Creator attribution is best-effort: tasks created outside an HTTP request context (Temporal activities, background workers, any path that constructs
AgentTaskServicewithoutrequest.state.principal_context) leave both columns NULL. The CHECK constraint allows both-NULL, andtest_no_resolvable_creator_leaves_both_columns_nullexercises this path.What changed
database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py(new): NOT VALID-pattern migration.down_revision = "6c942325c828".src/adapters/orm.py: declarativeCheckConstraintmirroring the DB constraint.src/domain/entities/tasks.py: new optional fields onTaskEntity.src/domain/services/task_service.py:_principal_fieldhelper (handles dict-vs-pydantic principal shape from the authn proxy).create_taskreadscreator_user_id/creator_service_account_idfrom principal context.AgentTaskService.__init__takesdual_write_enabled: DEnvironmentVariable(EnvVarKeys.FGAC_TASKS_DUAL_WRITE)._dual_write_with_retry(op_name, do_call, task_id)keyed by op name; reused from both call sites.parent=to match agentex-auth's wire schema.src/adapters/authorization/adapter_agentex_authz_proxy.py: forwards to agentex-auth's/v1/authz/register(JSON body fieldparent) and/deregister.src/adapters/authorization/port.py:register_resource(... parent: AgentexResource | None = None).src/config/environment_variables.py: newFGAC_TASKS_DUAL_WRITEkey.test_task_audit_columns.py— testcontainers Postgres integration tests for the audit columns (creator population, mutual-exclusion CHECK, both-NULL allowed).test_task_fgac_dual_write.py— register-on-create, deregister-on-delete, flag-off skip, transient retry-and-succeed (both register and deregister), retry exhaustion propagating with the Postgres row preserved, and the name-routeItemDoesNotExistswallow. Assertions on theparentkwarg align with agentex-auth's wire schema.dual_write_enabledconstructor parameter.Test plan
migration_lint.py— clean.test_task_audit_columns.py— passes locally via testcontainers.test_task_fgac_dual_write.py— collects cleanly; runs in CI integration suite.\d tasksshows new columns + constraint + indexes; flip flag on for one account, confirmtask_fgac_dual_write.successfires.Greptile Summary
This PR adds task creator audit columns (
creator_user_id,creator_service_account_id) to thetaskstable and implements an operator-gated FGAC dual-write path that registers/deregisters tasks with the agentex-auth authorization graph on create/delete.create_task; aCHECKconstraint enforces at-most-one creator type; partialCONCURRENTLYindexes scope to non-NULL rows only.FGAC_TASKS_DUAL_WRITE(off by default);_dual_write_with_retryprovides exponential-backoff retry for transient auth errors, emitting Datadog counters for rollout observability; the Postgres row is the durable record regardless of auth-graph outcome._UNSETsentinel refactor: Replaces the...sentinel inAuthorizationServicemethods with a named_UNSETobject; also addsregister_resource/deregister_resourcedelegating to the new gateway methods.Confidence Score: 5/5
Safe to merge: flag defaults to off, no behavior change on deploy; audit columns are NULLable and additive; migration uses CONCURRENTLY indexes and the NOT VALID/VALIDATE split inside an autocommit_block.
The dual-write path is completely gated and does not affect any existing code path when the flag is off. The Postgres row is the durable record regardless of auth-graph outcome, and the retry/metric pattern mirrors the existing codebase. Tests are comprehensive, covering both call sites, flag-off, retry-and-succeed, and exhaustion scenarios.
No files require special attention beyond what has already been discussed in earlier review threads.
Important Files Changed
Sequence Diagram
sequenceDiagram participant C as API Caller participant TS as AgentTaskService participant TR as TaskRepository (Postgres) participant AS as AuthorizationService participant GW as AgentexAuthProxy C->>TS: create_task(agent, name, ...) TS->>TS: read principal_context → creator_user_id / creator_service_account_id TS->>TR: repository.create(task + creator columns) TR-->>TS: task_entity (committed) alt "FGAC_TASKS_DUAL_WRITE=on" TS->>AS: "register_resource(task, parent=agent)" loop retry up to 3x on transient error AS->>GW: POST /v1/authz/register GW-->>AS: 200 OK / transient error end AS-->>TS: success / exhausted → raise end TS-->>C: task_entity C->>TS: delete_task(id or name) alt name-only + flag on TS->>TR: get(name) → task.id end TS->>TR: repository.delete(id, name) TR-->>TS: deleted alt "FGAC_TASKS_DUAL_WRITE=on and id resolved" TS->>AS: deregister_resource(task) loop retry up to 3x on transient error AS->>GW: POST /v1/authz/deregister GW-->>AS: 200 OK / transient error end end TS-->>C: donePrompt To Fix All With AI
Reviews (4): Last reviewed commit: "feat(AGX1-274): task FGAC dual-write cal..." | Re-trigger Greptile